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ABSTRACT . ... 

A Study designed to compare the reliabilities of 

multiple choice and true-false tests that were constructed to measure 
the same objectives was conducted. The impetus far this study came 
from the research reported by Ebel (1971) on the same topic. Subjects 
were selected from six public high schools. Three phases of testing 
were required for instrument development and data gathering. Phase I 
involved collecting item analysis data for one item conversion method 
and Phase II was used to try out the true-false items. The final 
phase of testing included 1018 students responding to eight final 
test forms. The social studies and natural science multiple choice 
items employed in this study appeared in a widely used battery of 
achievement tests. The original 70- item multiple choice tests SM 
(social studies) and NM (natural science) were each administered to a 
minimum of 100 subjects. The four true-false test forms were each 
administered to a minimum of 50 subjects in Phase ll. The eight final 
test forms varied according to subject matter, icem conversion 
method, and item form order. The results of this study support the 
notion that students respond to more true-false than multiple choice 
items in a given period of time. However, the data indicate that the 
multiple choice tests were more reliable though they tended to 
measure the same thing that the true-false tests measured. (CK) 
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Purpose of the Study 

this study was designed to compare the reliabilities of multiple choice and 
true-false tests that v/ere constructed to measure the same objectives. A 
second purpose v;as to determine If multiple choice tests and the true-false 
tests derived from them measured the same thinp. 

Background 

the impetus for this study came from the research reported by Ebel (1971) on 
the same topic. His data, In general, supported the notion that true-false 
tests can be just as reliable as multiple choice tests and both measure re- 
latively the same thing. Two assumptions made In the original study were 
eliminated In the present study. Data were gathered to determine the ratio 
of the number of true-false to multiple choice Items attempted by examinees 
In a fixed period of time. This ratio v/as required to adjust the K-R2 Q*s 
the true-false tests to equate testing tine. The ratio was estimated (2 to 1) 
in the original study. The second change ms to use a systematic and relatively 
objective procedure for converting test Items fiom multiple choice to true- 
false form. Tv/o different conversion methods v/ere employed In the present 
study. In the earlier study development of the true-false Items Involved 
considerable subjective judgment on the part of the Item v/rlter. 

The bulk of the studies reported In the literature that deal with reliability 

and validity of tests of varying Item form i»ere done In the late 1920' s and 

early 1930’s when objective examinations began to flourish (Frisbie, 1971). ^ 

Sample 

The subjects that participated In this study were selected from classrooms In 

six public high schools In Michigan. Classrooms and schools cooperated on a 

voluntary basis but v/ere originally approached so that the final sample might 

represent a cross section of non-urban high school students In science and ^ 

social studies achievement. ^ 
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Three phases of testing v/ere required for Instrument development and data 
gathering. Phase I Involved collecting Item analysis data for one Item con- 
version method and Phase II was used to try out the true-false Items. The 
final phase of testing Included lOlD students responding to eight final test 
forms (see Table 1). 

Instrumentation 

The social :»tud1es and natural science multiple choice Items employed In this 
study appeared In a v/ldely used battery of achievement tests. The Items were 
written to measure knowledge and understanding of concepts that are part of 
the current secondary school curriculum. 

The judgmental conversion method (J) required secondary science and social 
studies teachers to judge the quality of the multiple choice dis tractors from 
the Items In their respective areas of expertise. They v/ere directed to select 
the distractor for each Item that appeared to be' most plausible for making a 
false statement with the stem. The use of this method resulted In 41 false 
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and 29 true statements in social studies and 45 false and 25 true staten^ents in 
natural science. The tm 70- item true-false tests v;ere labeled forms SJ 
(social studies) and NJ (natural science). 

The oriciinal 70-iten multiple choice tests, Sii (social studies) and Tin (natural 
science) v;ere each administered to a ninimun of 100 subjects in classrooms 
similar to those involved in the final (Phase III) part of the study. Item 
analysis data ms used to calculate a loi^er-upper discrimination index for each 
item response alternative. The foil with the largest lower-upper difference for 
each item was used to make a false statement v/ith the stem. The discrimination 
conversion method (D) furnished 37 false and 33 true statements for form SD 
(social studies) and 37 false and 33 true statements for form m (natural 
science). 

The four true-false test forms v/ere each administered to a minimum of 50 sub- 
jects in Phase II of testing. Three items viere slightly revised based on this 
try-out and all true-false items v/ere then incorporated in eight forms for 
final testing. 

The eight final test forms varied according to subject matter, item conversion 
method > and item form order. The composition of these forms is indicated by 
Figure 1. Form SJA, for example « consisted of items 1-35 of the origin/:.! . 
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multiple choice form (SM) and items 36-70 of form SJ (social studies items 
converted by the judgmental method). Form SOB was comprised of items 1-35 of 
form SO and itens 36-70 of form SM. 

The four final forms in each subject matter area v/ere administered to randomly 
selected students in classrooms. Subjects v/ere stopped after eight minutes 
of testing and v/ere asked to circle the number of the Item on which they were 
working. This data was used to determine the amount of time reguired to 
respond to items of each of the tv/o forms. 

RGSUI IrS 

A K-Roq was computed for each of the two subtests in each of the eight final 
test Tomis. The reliabilities of the true-false subtests v/ere then adjusted 
to permit comparison of the ti’/o item forms on the basis of equal amounts of 
testing time, rather than on the basis of equal numbers of items. Since the 
subjects in this study responded to 25.59 true-false items in eight minutes, 
but only 17.04 multiple choice items in the same amount of time, the value 1.5 
v/as used for n in the Spearman-Brown formula. The reliability coefficients 
for the 16 subtests are recorded in Table 2. 
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The difference between reliability coefficients for the subtests using the two 
item forms (multiple choice vs adjusted true-false) was tested for statistical 
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significance using a palred-t test. The difference In favor of the multiple 
choice 1t0)is i/as significant beyond the .'^Ol level. Ho significant difference 
(p less than .50) was found beti^een the reliabilities of the true-false tests 
derived by the t‘/o different methods of Item conversion. 



Each subject received a score on the multiple choice and on the true-false test 
to vihlch he responded. A Pearson product-moment correlation v/as calculated 
between subtest scores on each of the eight forms. Table 3 shov/s the correlation 

coefficients ark! thrj coefficients corr^tcv •^or ottenii^tl "A t st statistic 
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developed by Forsyth and Feldt (1969) was used to generate 90% confidence Intervals 
for the eight dlsattenuated coefficients. The upper and lower limits for these 
Intervals are depicted In Table 4. The hypothesis that the dlsattenuated corre- 
lation coefficient does not differ from unity Is supported In six of the eight 
cases. 
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Conclusions and Discussion 

The results of this study support the notion that students respond to more true- 
false than multiple choice Items In a given period of time. However, the data 
Indicate that the multiple choice tests were more reliable though they tended to 
measure the same thing that the true- false tests measured. These generalizations 
require some cautionary remarks. 

The original tests used In this study were not typical of those constructed by 
classroom teachers. The Items v/ere cast to measure primarily understandings 
and relationships betiveen concepts. The reliabilities of these tests were much 
higher (.90) than the reliabilities classroom teachers achieve with their Instru- 
ments. It Is possible that the results of this study would be different If a 
typical teacher-made multiple choice test had been used originally. The shorter 
test with less discriminating Items would probably yield a smaller range of 
scores and, therefore, smaller reliability coefficients. 

The concurrent validity data should be Interpreted v;1th some care. Tvio of the 
eight confidence Intervals failed to Include unity whereas two of the eight 
(MOB and NDB) were almost certain to Include unity by Inspection. The probability 
that all eight confidence Intervals Included the true population value of the 
corrected coefficient was 0.43. 

The relatively large estimated standard errors of the dlsattenuated correlation 
coefficients (see Table 4) caused several of the confidence Intervals to be 
relatively wide. These large estimates v/ere a function of half-test reliabili- 
ties for the multiple choice and true-false tests. The median half-test reli- 
abilities were .730 and .431 for multiple choice and true-false tests, respectively 
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Though the data fror.i the confidence intervals support the hypothesis that true- 
false and multiple choice tests measure the same thing, the data are not con- 
clusive. The variability in observed correlation coefficients (see Table 3) 
i;iay be explained in terms of sampling fluctuations, yet these may not account 
for all of the discrepancies. 

It may be true that multiple choice and true-false tests require somewhat different 
abilities of the examinees. For example, a student may mark a statement true 
because he could not think of a counterexample, a situation or occurrence that 
would make the proposition false. His search for a counterexample may have been 
bounded by tim limits or the length to which he could stretch his mind or the 
depth of his retrieval system that he could penetrate. The multiple choice 
item, however, limits the universe of comparisons that the individual must make. 

He can decide vihich alternative makes a true statement with the item stem and 
then review the remaining alternativas to determine if any of them is a counter 
example for the true statement. Though individuals probably differ in the 
responding schemes they use, their manners of responding to true-false and multiple 
choice items may depend on somewhat different abilities. The observed correlation 
coefficients in this study may reflect these differences. The question then 
arises, if the tiio item types measure different things, which one best measures 
what v/e want to measure? If we are satisfied that our achievement test measures 
relevant tasks, what suitable external criterion could be used for prediction? 

Mhen that suitable criterion is discovered we will probably use it to measure 
achievement instead of our multiple choice or true-false test. 

The data from this study do not provide support for those individuals who believe 
that true-false items are as effective as multiple choice items for measuring 
classroom achievement. Though the longer true-false tests viere less reliable, 
they exhibited a potential for more adequately sampling the domain of social 
rtudies and natural science than did the multiple choice tests. Students could 
theoretically, attempt 105 true-false items in the time required to respond to 
70 multiple choice items, though the former test may be somewhat less reliable. 

Though the ratio remained constant (1.5), students attempted slightly fewer 
natural science than social studies items in eight minutes of testing. This 
suggests that no hard and fast rules can be formulated regarding the amount 
of time required to respond to different types of items without considering 
item content as well. 
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TABLE 1 

Sample Used In Three Phases of Testing 



GRADE 




I 


PHASE 

II 


III 


9 


Social Studies 


0 


9 


n 




ilatural Science 


49 


24 


100 


10 


Social Studies 


0 


42 


145 




natural Science 


27 


47 


141 


11 


Social Studies 


72 


42 


194 




liatural Science 


18 


35 


129 


12 


Social Studies 


30 


17 


260 




natural Science 


7 


0 


59 




Total 


293 


207 


1018 
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FIGURE 1 

Arrangement of Test Forms Used In Phase III 






test form 


Subtest order 


SOA 


MC 


TF 


SOB 


TF 
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SDA 


?1C 


TF 


SOB 


TF 


fC 


im 


fC 


TF 


NOB 


TF 


MC 


flOA 


nc 


TF 


MDB 


TF 
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TABLE 2 

K-Rgo Reliabilities for Final Subtest Forms 



Test Form 




Subtest 




Multiple 

Choice 


True-False 

Oriqinal Adjusted 


SJA 


.796 


.708 


.785 


SJB 


.527 


.654 


.739 


SDA 


.805 


.498 


.598 


SDB 


.551 


.641 


.728 




.035 


.759 


.825 


m 


.852 


.612 


.703 


:m 


.854 


.704 


.781 


ilDB 


.862 


.645 


.732 
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TABLE 3 

Correlation Coefficients for Multiple Choice arid True 
False Subtest Scores on Each Final Test Form 



Test Form 


rtnt 


Too 00_ _ 


n 


SJA 


.578 


.769 


126 


SJC 


.697 


.947 


127 


SDA 


.564 


.891 


128 


SDB 


.430 


.582 


128 


tUA 


.661 


.831 


126 


(IJD 


.728 


1.009 


129 


im 


.710 


.916 


125 


tm 


.825 


1.107 


129 
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TABLE 4 

Confidence Intervals for Dlsattenuated MC-TF 
Correlation Coefficients 



Test Form 


Est. Standard 
Error 


Upper 

Limit* 


Lower 

Limit* 


SJA 


.0710 


.937 


.601 


SJC 


.0629 


1.051 


.844 




.127G 


1.090 


.682 


SDB 


.0913 


.733 


.431 


.UA 


.1850 


1.135 


.527 


ruB 


.2075 


1.350 


.568 


MBA 


.1741 


1.202 


.620 


»DB 


.1750 


1.395 


.819 



*90% confidence Intervals 
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